Use helpers in your recordExtractor to make it easier to extract relevant content from your page. Algolia has a selection of helpers:
  • product
  • article
  • page
  • splitContentIntoRecords
  • codeSnippets
  • docsearch.

product

This helper extracts content from product pages. A “product page” is an HTML page with one of thes JSON-LD schema types:
JavaScript
recordExtractor: ({ url, $, helpers }) => {
  return helpers.product({ url, $ });
}

Response

The helper returns an object with the following properties:
objectID
string
The product page’s URL.
url
string
The product page’s URL (without parameters or hashes).
lang?
string
The language the page content is written in (from the name field of the JSON-LD product schema).
sku
string
The sku field of the JSON-LD schema.
description?
string
The description field of the JSON-LD schema.
image?
string
The image field of the JSON-LD schema.
price?
string
The product’s price, selected from one of these JSON-LD schema fields, in the order:
  1. offers.price
  2. offers.highPrice
  3. offers.lowPrice.
currency?
string
The offers.priceCurrency field of the JSON-LD schema.
category?
string
The category field of the JSON-LD schema.

article

This helper extracts content from article pages. An “article page” is an HTML page with an appropriate JSON-LD schema or meta tag:
One of these JSON-LD schema types:
JavaScript
recordExtractor: ({ url, $, helpers }) => {
  return helpers.article({ url, $ });
}

Response

The helper returns an object with the following properties:
objectID
string
The article’s URL.
url
string
The article’s URL (without parameters or hashes).
lang?
string
The language the article is written in (from the HTML lang attribute)
headline
string
The article’s headline, selected from one of these, in the order:
  1. meta[property="og:title"]
  2. meta[name="twitter:title"]
  3. head > title
  4. First <h1>.
description?
string
The article’s description, selected from one of these, in the order:
  1. meta[name="description"]
  2. meta[property="og:description"]
  3. meta[name="twitter:description"].
keywords
string array
The keywords field of the JSON-LD schem.
tags
string array
Article tags: meta[property="article:tag"].
image?
string
The image associated with the article, selected from one of these, in the order:
  1. meta[property="og:image"]
  2. meta[name="twitter:image"].
authors?
string array
The author field of the JSON-LD schema.
datePublished?
string
The datePublished field of the JSON-LD schema.
dateModified?
string
The dateModified field of the JSON-LD schema.
category?
string
The category field of the JSON-LD schema.
content
string
The article’s content (body copy).

page

This helper extracts text from pages regardless of its type or category.
recordExtractor: ({ url, $, helpers }) => {
  return helpers.page({
    url,
    $,
    recordProps: {
      title: 'head title',
      content: 'body',
    },
  });
}

Response

The helper returns an object with the following properties:
objectID
string
The object’s unique identifier.
url
string
The page’s URL.
hostname
string
The URL hostname (for example, example.com).
path
string
The URL path: everything after the hostname.
depth
number
The URL depth, based on the number of slashes after the domain. For example, http://example.com/ = 1, http://example.com/about = 1, http://example.com/about/ = 2.
fileType
file type
The page’s file type. One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, or email.
contentLength
number
The page length in bytes.
title?
string
The page title, derived from head > title.
description?
string
The page’s description, derived from meta[name="description"].
keywords?
string array
The page’s keywords, derived from meta[name="keywords"].
image?
string
The image associated with the page, derived from meta[property="og:image"].
headers?
string array
The page’s section titles, derived from h1 and h2.
content
string
The page’s content (body copy).

splitContentIntoRecords

This helper extracts text from long HTML pages and splits them into smaller chunks. This can help prevent “Record too big” errors. Using this example record extractor on a long page returns an array of records, each one smaller than 1,000 bytes.
JavaScript
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // Produced records can be modified after creation, if necessary.
  return records;
}
When splitting pages, some words will appear in records belonging to the same page. If you don’t want these duplicates to turn up when users search:
  • Set distinct to true in your index. distinct: true
  • Set the attributeForDistinct to your page’s URL. For example, attributeForDistinct: 'url'.
  • Set searchableAttributes’ to be your page title and body content. For example, [ 'searchableAttributes: [ 'title', 'text' ].
  • Add a customRanking to sort from the first split record on your page to the last. For example, customRanking: [ 'asc(part)' ].
JavaScript
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Response

Specify one or more response parameters in your helper to determine what information is returned.
baseRecord
record
default:"{}"
Takes this record’s attributes (and values) and adds them to all the split records.
$elements
string
default:"$('body')"
A Cheerio selector that determines from which elements content will be extracted. For more information, see Extracting data with Cheerio.
maxRecordBytes
number
default:"10000"
Maximum number of bytes allowed per record. To avoid errors, check your plan’s record size limits.
orderingAttributeName
string
This attribute stores the sequentially generated number assigned to each record when the helper splits a page.
textAttributeName
string
default:"text"
Name of the attribute in which to store the text of each split record.

codeSnippets

Use this helper to extract code snippets from pages. The helper finds code snippets by looking for <pre> tags and extracting the content and the language class prefix from the tag.
If the crawler finds several code snippets on a page, the helper returns a list of those snippets.
JavaScript
recordExtractor: ({ url, $, helpers }) => {
  const code = helpers.codeSnippets({ tag, languageClassPrefix })
  return { code };
}

Response

The helper returns an array of code objects with the following properties:
content
string
The code snippet.
languageClassPrefix?
string
The code snippet’s language (if found).
codeUrl?`
string
The URL of the nearest sibling <a> tag.
fragmentUrl?
string
Text fragment URL with the code snippet. This is a selection of text within a page that’s linked to another page.

docsearch

This helper extracts content and formats it to be compatible with DocSearch. It creates an optimized number of records for relevancy and hierarchy. You can also use it without DocSearch or to index non-documentation content. For more information, see the DocSearch documentation.
JavaScript
recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}